---
title: Feature details
description: How to work with a feature on the Data page, to view its details and also (in some cases) modify its type.

---

# Feature details {: #feature-details }

The **Data** page displays tags to indicate a variety of information that DataRobot uncovered while computing EDA1. You can also [click a feature name](#view-feature-details) to view its details.

## Data page informational tags {: #data-page-informational-tags }

![](images/data-tags.png)

Informational tags on the **Data** page include:

|  Tag  |   Description  |
|-------|----------------|
|  Duplicate   | A feature column is duplicated in the ingest dataset. |
|  Empty  | Column contains no values.   |
|  Few values   | Too few values, relative to the size of the dataset, for DataRobot to extrapolate meaningful information from the feature. Not an indicator of the number of unique values, but instead domination of a single value, making the feature inappropriate for modeling. Specifically:<ul><li> A numeric with no missing values and only one unique value. </li><li> A variable in which >99.9% is the same value </li></ul> |
| Too many values  | Too many values, relative to the size of the dataset, for DataRobot to extrapolate meaningful information from the feature. For categorical features, the label is applied if: `[ number of unique values ] > [ number of rows] / 2 |`
| Reference ID&ast;  | Column contains reference IDs (unique sequential numbers). |
| Associated with Target  | Column was derived from target column. |
| [Target leakage](quality-check#target-leakage) | Indicates a feature whose value cannot be known at the time of prediction.  |

??? note "&ast; Reference ID calculations"
     A feature is considered a reference ID if *all* of the following apply:

     * The feature is an integer and not a date.
     * The number of rows in the data is greater than 2000.
     * Feature values are unique (`[ number of unique values] = [number of rows]`)
     * Feature values are "compact." That is, the highest and lowest values are not more than `100 * rows` apart.

## View feature details {: #view-feature-details }

Once DataRobot displays features on the **Data** page, you can click a feature name to view its details and also (in some cases) modify its type. The options available are dependent on variable type:

|  Option   | Description |  Variable Type  |
|-----------|-------------|-----------------|
| _Tabs_ | :~~: | :~~: |
| [Histogram](#histogram-chart) | Buckets numeric feature values into equal-sized ranges to show a rough distribution of the variable. | numeric, summarized categorical, [multicategorical](multilabel#histogram-tab) |
| [Frequent Values](#frequent-values-chart) | Plots the counts of each individual value for the most frequent values of a feature. If there are more than 10 categories, DataRobot displays values that account for 95% of the data; the remaining 5% of values are bucketed into a single "All Other" category. | numeric, categorical, text, boolean  |
| [Table](#table-tab)  | Provides a table of feature values and their occurrence counts. Note that if the value displayed contains a leading space, DataRobot includes a tag, leading space, to indicate as much. This is to help clarify why a particular value may show twice in the histogram (for example, 36 months and 36 months are both represented). | numeric, categorical, text, boolean, summarized categorical, multilabel |
| [Illustration](#illustration-table) | Shows how summarized categorical data&mdash;features that host a collection of categories&mdash;is represented as a feature. See also the [summarized categorical tab differences](#summarized-categorical-features) for information on Overview and Histogram. | summarized categorical  |
| [Category Cloud](analyze-insights#category-cloud-insights) | After EDA2 completes, displays the keys most relevant to their corresponding feature in Word Cloud format. This is the same Word Cloud that is available from the Category Cloud on the **Insights** page. From the **Data** page you can more easily compare Clouds across features; on the **Insights** page you can compare Word Clouds for a project's categorically-based models. | summarized categorical  |
| [Feature Statistics](multilabel#feature-statistics-tab) | Reports overall multilabel dataset characteristics, as well as pairwise statistics for pairs of labels and the occurrence percentage of each label in the dataset. | multilabel
| [Over Time (time-aware only)](ts-leaderboard#understand-a-features-over-time-chart)  | Identifies trends and potential gaps in data by displaying, for both the original modeling data and the derived data, how a feature changes over the primary date/time feature.  | numeric, categorical, text, boolean  |
| Feature Lineage  [(time series)](ts-leaderboard#feature-lineage-tab) or  [(Feature Discovery)](fd-gen#project-data-tab) | Provides a visual description of how a derived feature was created. | numeric, categorical, text, boolean |
| _Actions_ | :~~: | :~~: |
| [Var Type Transform](feature-transforms#variable-type-transformations) | Provides a dialog to modify the variable type. (Not shown if the variable type for this feature was previously transformed.)  | numeric, categorical, text    |
| [Transformation](feature-transforms#create-transformations)   | Shows details for a selected transformed feature and a comparison of the transformed feature with the parent feature. (Applies to transformed features only.) | numeric, boolean  |




!!! note
    The values and displays for a feature may differ between EDA1 and EDA2. For EDA1, the charts represent data straight from the dataset. After you have selected a target and built models, the data calculations may have fewer rows due to, for example, holdout or missing values. Additionally, after EDA2 DataRobot displays [average target values](#average-target-values) which are not yet calculated for EDA1.

## Histogram chart {: #histogram-chart }

{% include 'includes/histogram-include.md' %}

### Change the distribution and display {: #change-the-distribution-and-display }

DataRobot breaks the data into several bins; the size of the bin depends on the number of rows in your dataset. You can change the number of bins to change the distribution range. The bin options depend largely on the number of unique values in the dataset. To change the distribution range use the dropdown:

![](images/histo-bins.png)

For classification projects, you can also (after EDA2) change the basis of the display to fill bins based on the number of rows or percentage of target value. The displays of the histogram and average target value overlay also change to match your selection.

### Display summaries {: #display-summaries }

To see the details of a selected bin, hover over the bin until a popup displays:

![](images/histo-bin-detail.png)

| | Element | Description |
|---|---|---|
| ![](images/icon-1.png) | Value | Displays the bin range located on the X-axis. |
| ![](images/icon-2.png) | Rows | Displays the number of rows in the bin (located on the left Y-axis).|
| ![](images/icon-3.png) | Percentage | Displays the [average target value](#average-target-values) (located on the right Y-axis). |

###  Calculate outliers {: #calculate-outliers }

Outliers, the observation points at the far ends of the sample mean, may be the result of data variability. They can also represent data error, in which case you may want to exclude them from the histogram. Outlier detection&mdash;run as part of [EDA1](eda-explained) using a combination of heuristics&mdash;is strictly a histogram visualization tool and does not influence the modeling process.

Outliers are generally calculated as a collection of two ranges:

* `p25` represents the values in the first quartile of a data distribution.
* `p75` represents the values in the third quartile of a data distribution.
* `IQR` is the Interquartile Range, equal to the difference of the first quartile subtracted from the third quartile: `IQR = p75-p25`.

The ranges are then calculated as the first quartile minus IQR (`p25-IQR`) and the third quartile plus IQR (`p75+IQR`). Note that this is a general overview of outlier calculation. Additional calculations are required depending on how these ranges compare to the minimal and maximal values of the data distribution. There are also additional heuristics used for corner cases that cover how DataRobot calculates IQR and the final values of the outlier threshold.

Check the **Show outliers** box and to initiate a calculation identifying the rows containing outliers. DataRobot then re-displays the histogram with outliers included:

![](images/calc-outliers.png)

Check and uncheck the box to switch the histogram display between off (excluding) and on (including) outliers:

Note that DataRobot reshuffles the bin values based on the display. With outliers excluded, there are more rows and each contains a smaller number of rows. When toggled on, each bin contains a greater number of rows because the bin has expanded its range of values.

The bin selection dropdown works as usual, regardless of the outlier display setting.

## Frequent Values chart {: #frequent-values-chart }

The Frequent Values chart is the default display for categorical, text, and boolean features, although it is also available to other feature types. The display is dependent on the results of the [data quality](quality-check#interpret-the-histogram-tab) check. With no data quality issues:

![](images/feature-details.png)

In many cases, you can change the display using the **Sort by** dropdown. By default, DataRobot sorts by frequency (**Number of rows**), from highest to lowest. You can also sort by &lt;<em>feature_name</em>&gt;, which displays either alphabetically or, in the case of numerics, from low to high. The [**Export** link](export-results) allows you to download an image of the Frequent Values chart as a PNG file.

After EDA2 completes, the Frequent Values chart also displays an [average target value](#average-target-values) overlay.

## Summarized categorical features {: #summarized-categorical-features }

The summarized categorical variable type is used for features that host a collection of categories (for example, the count of a product by category or department). If your original dataset does not have features of this type, DataRobot creates them (where appropriate as described below) as part of EDA2. The summarized categorical variable type offers unique feature details in its [**Overview**](#overview-tab-for-summarized-categorical), [**Histogram**](#histogram-tab-for-summarized-categorical), [**Category Cloud**](#category-cloud-tab), and [**Table**](#table-tab) tabs.

!!! note
    	You cannot use summarized categorical features as your target for modeling.

### Required dataset formatting {: #required-dataset-formatting }

For features to be detected as the summarized categorical variable type (shown in the Var Type column on the **Data** tab), the column in your dataset must be a valid JSON-formatted dictionary:

`"Key1": Value1, "Key2": Value2, "Key3": Value3, ...`

* `"Key":` must be a string.
* `Value` must be numeric (an integer or floating point value) and greater than 0.
* Each key requires a corresponding value. If there is no value for a given key, the data will not be usable.
* The column must be JSON-serializable.

The following is an example of a <em>valid</em> summarized categorical column:

`{“Book1”: 100, “Book2”: 13}`

An <em>invalid</em> summarized categorical column can look like any of the following examples:

* `{‘Book1’: 100, ‘Book2’: 12}`

	* The key is not in quotation marks (not JSON-serializable).

* `{‘Book1’: ‘rate’,‘Book2’: ‘rate1’}`

	* These values are strings instead of positive numeric values.

* `{“Book1”, “Book2”}`

	* This example is not in JSON dictionary format.

### Overview tab for summarized categorical {: #overview-tab-for-summarized-categorical }

The **Overview** tab presents the top 50 most frequent keys for your feature. Each key displays the percentage of rows that it appears in, its mean, standard deviation, median, min, and max. You can sort the keys by any of these fields. Most of this information is available for other types of features in the columns on the **Data** page, but for summarized categorical features each individual key has its own values for these fields.

![](images/sum-cat-1.png)

| | Element | Description |
|---|---|---|
| ![](images/icon-1.png) | Export | Export the list of keys and their associated values as a PNG. You can choose to include the chart title in the image and edit the filename before you download it.|
| ![](images/icon-2.png) | Page control | Move through pages of listed keys (10 keys per page). |
| ![](images/icon-3.png) | Histogram icon |  Access the histogram for a given key.  |


### Histogram tab for summarized categorical {: #histogram-tab-for-summarized-categorical }

While most of the functionality for this tab is the same as described in the [working with histograms](#histogram-chart) section above, there are some differences unique to this variable type. The histograms displayed in this tab correspond to the individual labels (keys) of a feature instead of a feature itself. The list of keys can be sorted by percentage of occurrence in the dataset's rows or alphabetically.

![](images/sum-cat-2.png)

| | Element | Description |
|---|---|---|
| ![](images/icon-1.png) |  Search | Searches for labels. |
| ![](images/icon-2.png) | Showing | [Changes the bin distribution](#change-the-distribution-and-display). Select the number of bins to view. |
| ![](images/icon-3.png) | Target values | Sets the basis of the [target value display](#change-the-distribution-and-display).   |
| ![](images/icon-4.png) | Scale Y-axis for large values |  Reduces the number of rows measured in the Y-axis for [large values](#viewing-large-values).|
| ![](images/icon-5.png) | Export | Exports the histogram.   |

!!! note
    DataRobot automatically filters out stopwords when calculating values for the histogram.

### Viewing large values {: #viewing-large-values }

The **Scale the Y-axis for large values** option reduces the number of rows measured in the Y-axis and improves the visualization of larger values&mdash;it is common that large numbers are only represented in a few rows. Resizing the histogram above results in:

![](images/sum-cat-5.png)

By scaling the Y-axis, you can see that the greatest value measured has been greatly reduced. As a result, the number of rows across all values are more evenly represented.

### Category Cloud for summarized categorical {: #category-cloud-for-summarized-categorical }

The **Category Cloud** tab provides insights into [summarized categorical](histogram#summarized-categorical-features) features. It displays as a [word cloud](word-cloud) and shows the keys that are most relevant to their corresponding feature.

![](images/insights-category-cloud.png)

{% include 'includes/category-cloud-include.md' %}

## Illustration table {: #illustration-table }

The **Illustration** tab shows how summarized categorical data is represented as a feature. For example, in the below image, the **Values** column contains five summarized categorical features displayed in JSON dictionary format (selected at random), as described above.

![](images/summ-cat-5.png)

Click **Summary** to display a box that visualizes how categorical values appeared in their initial state, prior to being engineered as summarized categorical features.

![](images/sum-cat-9.png)

## Table tab {: #table-tab }

The **Table** tab, which is the default tab for [multilabel](multilabel) projects, displays a two-column table detailing counts for the top 50 most frequent label sets in the multicategorical feature.

![](images/sum-cat-3.png)

The table lists each key in the **Values** column, and the respective key's count in the **Count** column.

!!! note "Unicode text in the Values column"
    If you are using Unicode text and it appears abnormal in the Values column, make sure your text is UTF8 encoded.


## Average target values {: #average-target-values }

After EDA2, DataRobot displays orange circles as graph overlays on the Histogram and Frequent Values charts. The circles indicate the average target value for a bin. (These circles are connected for numeric features and not for categorical, since the ordering of categorical variables is arbitrary and histograms display a continuous range of values.)

For example, consider the feature `num_lab_procedures`:

![](images/av-target-value.png)

In this example, there are 846 people who had between 44-49.999999 lab procedures. The average target value represented by the circle (in this case, the percent readmitted) is 37.23%. (The orange dots correspond to the right axis of the histogram.)

### How Exposure changes output {: #how-exposure-changes-output }

If you used the [Exposure](additional#set-exposure) parameter when building models for the project, the **Histogram** and **Frequent values** tabs display the graphs adjusted to exposure. In this case:

![](images/eda2-tooltip-exposure.png)

* The <em>number of rows</em> (1) in each bin.
* The <em>sum of exposure</em> (2) in each bin. That is, the sum of the weights for all rows weighted by exposure.
* The <em>sum of target</em> value divided by the <em>sum of the exposure</em> (3) in a bin.

### How Weight changes output {: #how-weight-changes-output }

If you set the Weight parameter for a project, DataRobot weights the number of rows and average target values by weight.
